In this week’s lab, the main goal is to gain some experience in building models, to explore and explain data. We will start with the famous gapminder data, and use regression models to study temporal trends in life expectancy across the globe. The we will use decision trees to build a spam filter, using data collected by Dr Cook and her students a number of years ago.
Received: by 10.103.136.194 with HTTP; Sun, 17 Sep 2017 14:57:50 -0700 (PDT)
In-Reply-To: <CAFvWOFKt9C-WYAWi0-QfA_0x+ej=5DSLsPoPY4NVh29Y=sDf8w@mail.gmail.com>
References: <6A89C7A8-CA54-42BE-938F-CF41CCE2F362@monash.edu> <CAFvWOFKt9C-WYAWi0-QfA_0x+ej=5DSLsPoPY4NVh29Y=sDf8w@mail.gmail.com>
From: David Frazier <david.frazier@monash.edu>
Date: Mon, 18 Sep 2017 07:57:50 +1000
Message-ID: <CAFvWOF+i6U=tFsb2v+2yQ1L91zXXcusSLKBe=XJwHXdY-7JZJQ@mail.gmail.com>
Subject: Re: formula sheet
To: Dianne Cook <dicook@monash.edu>
Content-Type: multipart/mixed; boundary="001a114fcb3aabdccf055969b77e"
--001a114fcb3aabdccf055969b77e
Content-Type: multipart/alternative; boundary="001a114fcb3aabdccd055969b77c"
--001a114fcb3aabdccd055969b77c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Hi There,
...
Cheers,
David
Open your project for this class. Make sure all your work is done relative to this project.
Open the lab10.Rmd file provided with the instructions. You can edit this file and add your answers to questions in this document.
The data has demographics of life expectancy and GDP per capita for 142 countries reported every 5 years between 1952 and 2007.
Observations: 1,704
Variables: 6
$ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
$ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
The plot of all the countries is really hard to explain, its very messy. There is generally some increasing tend in the lines. Some lines have big drops.1950 is the first year, so for model fitting we are going to shift year to begin in 1950, makes interpretability easier.
Then let’s fit a model for Australia
# A tibble: 6 x 7
country continent year lifeExp pop gdpPercap year1950
<fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
1 Australia Oceania 1952 69.12 8691212 10039.60 2
2 Australia Oceania 1957 70.33 9712569 10949.65 7
3 Australia Oceania 1962 70.93 10794968 12217.23 12
4 Australia Oceania 1967 71.10 11872264 14526.12 17
5 Australia Oceania 1972 71.93 13177000 16788.63 22
6 Australia Oceania 1977 73.49 14074100 18334.20 27
Call:
lm(formula = lifeExp ~ year1950, data = oz)
Coefficients:
(Intercept) year1950
67.9451 0.2277
(2pts) Interpret the model. (This means explain how life expectancy changes over years, since 1950, using the parameter estimates of the model.) From a life expectancy of 67.9 in 1950, it has increased by 2.2 years for every decade.
(1pt) What was the average life expectancy in 1950? 67.9
(1pt) What was the average life expectancy in 2000? 79.3
(1pt) By how much did average life expectancy change over those 50 years? About 12 years
We can get various diagnostics out for the model with the broom package: the parameter estimates and their significance, the goodness of fit statistics, and model diagnostics.
(1pt) What column of the diagnostics contains the (a) fitted values, (b) residuals? .fitted and .resid respectively.
Now we are going to fit a simple linear model separately to every country. And use the model fits to simplify the patterns across the globe, in order to be able to explain the changes in life expectancy.
This code will compute the models for you:
| country | continent | intercept | year1950 |
|---|---|---|---|
| Afghanistan | Asia | 29.35664 | 0.2753287 |
| Albania | Europe | 58.55976 | 0.3346832 |
| Algeria | Africa | 42.23641 | 0.5692797 |
| Angola | Africa | 31.70797 | 0.2093399 |
| Argentina | Americas | 62.22502 | 0.2317084 |
| Australia | Oceania | 67.94507 | 0.2277238 |
# A tibble: 1 x 4
country continent intercept year1950
<fctr> <fctr> <dbl> <dbl>
1 Australia Oceania 67.94507 0.2277238
It is also possible to use a for loop to compute the slope and intercept for each country.
country_coefs data frame. Do a hand-sketch of the fitted model. Various answers, but should show the correct intercept and slope with axes marked and labels.plotly package, and find out which countries had a negative slope. Rwanda, Zambia, Zimbabwe(2pts) Statistically summarise the relationship between intercept and slope, using words like no association, positive linear association, negative linear association, weak, moderate, strong, outliers, clusters. The association is negative moderate and linear, with some clustering by continent.
(2pts) Do you see a difference between continents? If so, explain what you see. Africa shows the lowest intercepts and most variation in the slope. Europe is high on intercept and low on slope. Asia, like the Americas, is varied on intercept but relatively high on slope.
(2pts) What does it mean for a country to have a high intercept, e.g. 70? The life expectancy in 1950 was quite high, e.g. 70 years.
(2pts) What does it mean for a country to have a high slope, e.g. 0.7? The country had dramatic increase in life expectancy over the years. A value of 0.7 means life expectancy increased by 7 years for every decade.
Now we are going to examine the fit for each country. We might expect that a linear model is a better fit for some countries and not so good for other countries. Here is the code to extract the model diagnostics for each country’s model.
Or you can use a for loop to compute this.
Each of these countries has a big dip in their life expectancy during the time of the study. Explain these using world history and current affairs information. (Feel free to google for news stories.) Civil wars and AIDS
The file SPAM-503.csv contains summaries of a week’s worth of emails from 19 people. The email was manually labelled as spam or not. We decided to examine our emails because the university had recently changed its spam filters, and the emails from the university president were being sent to spam. Spam filters have improved dramatically in the last decade, but something happened to Monash mail this past week. Emails from Monash students and departmental colleagues have been discovered in the spam folder. Here is a description of the variables:
1. ISUid: ISU id
2. id: e-mail id (some count from 1 to number of mails you got, so that
you can get back to the original message for the line of data -
to help with checking for strange results.)
3. Day of Week: Sun, Mon, Tue, Wed, Thu, Fri, Sat
4. Time of Day: 0-23 (only integer values)
5. Size [kb]: Size of e-mail in kilo byte
6. Box: Is sender in any of my Inboxes or Outbox (ie known to you)
yes, no
7. Domain: Domain name of sender's e-mail address (only last segment):.edu,
.net, .com, .org, .gov, .mil, .de, .fr, .ru,
8. Local: Sender's e-mail is in local domain i.e. xx@yy.iastate.edu
yes, no
9. Digits: Number of numbers (0-9) in the
senders name: e.g. lottery2003@yahoo.com will be 4.
10. name: Name field is a single word or empty:
e.g. "Andreas Buja <andreas@research.att.com>" is name
"bob <lottery2003@yahoo.com>" is single
"<lottery2003@yahoo.com>" is empty
11. %capital: % capital letters in subject line
12. NSpecial: umber of special characters (i.e. non a-z, A-Z or 0-9) in subject
Spam words in subject line:
13. credit: mortgage, sale, approve, credit -> yes/no
14. sucker: earn, free, save ->yes/no
15. porn: nude, sex, enlarge, improve -> yes/no
16. chain: pass, forward, help > yes/no
17. username: Is your username/name listed in subject line ->yes/no
18. Large text in e-mail
yes, no (only yes, if html e-mail and size="+3" or size="5" or
higher. Visual inspection of e-mail will tell.)
19. Probability of being spam, according to ISU spam filter. Look for
"Probability=x%" in the header of the email. And record the "x" or an
NA if the message doesn't have a probability. This variable will be
used to compare our classification results from our data. (Has a lot of missing values, because not everyone read email through the unversity mail system.)
20. Extended spam/mail category
commercial->com,
lists->list,
newsletter->news,
ordinary->ord
21. Spam
yes, no
minsplit=10 and cp=0.005. ) tr_pred
no yes
no 716 15
yes 15 340
ts_pred
no yes
no 703 27
yes 17 338
For the training data, the false positive rate is 15/731=0.021, and false negative rate is 0.042. On the test set, the false positive rate is 0.037 and false negative rate is 0.048. False positive is worse, because these are real emails being sent to the spam folder.
`(1) Email from Category com, (2) emails that are from other Category and have a sucker key word in the Subject, (3) emails that are from other Category, don’t have a sucker key word in the Subject, and have more than 5.5 digits in the sender’s name, (4) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, more than 9.5kb, (5) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, less than 9.5kb, but arrives on a weekend. (6) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, less than 9.5kb but bigger than 1.5kb, (7) mails that are from other Category, don’t have a sucker key word in the Subject, and have less than 5.5 digits in the sender’s name, from a Domain com or net, not in the user’s inbox, less than 1.5kb but sender has no name.